Vinho Verde Red Wine Exploration by Ronan Casey

This report explores a dataset containing subjective quality assessment scores and contributing physiochemical test measurements for approximately 1,600 red wines of the Vinho Verde variety.

Univariate Plots Section

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Our dataset consists of 12 variables, with approximately 1,600 observations.

The distribution of perceived quality is normal with score out of 10 ranging from 3 to 8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## 
##  7.2  7.1  7.8  7.5    7  7.7  6.8  7.6  8.2  7.3  7.4  7.9    8  8.3  6.9 
##   67   57   53   52   50   49   46   46   45   44   44   42   42   40   38 
##  6.6  8.8  8.9  9.1  6.7  8.6  8.1  8.4    9  9.9  6.4  8.7   10  9.3 10.4 
##   37   34   33   29   28   27   26   26   26   26   25   24   23   22   21 
##  6.2  8.5 10.2  6.5  9.4  9.6  6.1  9.2  9.8  5.6 
##   20   19   19   17   17   17   16   16   15   14

The majority of the wines have tartaric acid concentrations from 7g to 10g / dm^3; median 7.9g / dm^3 and mean 8.32g / dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most wines have concentrations of acetic acid from 0.3g to 0.7g / dm^3; median 0.5200g / dm^3 and mean 0.5278g / dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
## FALSE  TRUE 
##  1598     1

Most wines have citric acid concentrations between 0.09g and 0.45g / dm^3. The distribution is skewed right with spikes in observations at 0g, 0.24g and 0.49g. Citric acid is a flavour enhancing additive. We can see from the histogram that a lot of producers choose not to add it. The 0.25g and 0.49g concentrations are probably just a result of people rounding additions. One producer seems to have doubled down and added a whopping 1g / dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The vast majority of wines contain between 1g and 3g / dm^3 residual sugar (post fermentation). There is quite a long tail on the histogram with a few outliers containing more than 10g / dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Sodium chloride (salt) is mainly found in concentrations between 0.05g and 0.1g / dm^3. The tail end of the distribution is similar to that of residual sugar. It might be interesting to see if these outliers represent the same observations.

## 
## FALSE  TRUE 
##  1570    29
## 
## FALSE  TRUE 
##  1577    22
##  [1] fixed.acidity        volatile.acidity     citric.acid         
##  [4] residual.sugar       chlorides            free.sulfur.dioxide 
##  [7] total.sulfur.dioxide density              pH                  
## [10] sulphates            alcohol              quality             
## <0 rows> (or 0-length row.names)

There appears to be no direct relationship between residual sugar and chlorides

The distributions of sulphates vs. free sulfur dioxide vs. total sulfur dioxide is quite similar. Like my previous inquiry I am curious to see if the outliers will be from the same observations.

## 
## FALSE  TRUE 
##  1591     8
## 
## FALSE  TRUE 
##  1595     4
## 
## FALSE  TRUE 
##  1597     2
##  [1] fixed.acidity        volatile.acidity     citric.acid         
##  [4] residual.sugar       chlorides            free.sulfur.dioxide 
##  [7] total.sulfur.dioxide density              pH                  
## [10] sulphates            alcohol              quality             
## <0 rows> (or 0-length row.names)

Once again, none of the outliers are constant across each variable.

## 
##  0.9972  0.9968  0.9976   0.998  0.9962  0.9978  0.9964   0.997  0.9994 
##      36      35      35      29      28      26      25      24      24 
##  0.9966  0.9982  0.9974  0.9984  0.9988  0.9986  0.9969  0.9973  0.9963 
##      23      23      22      20      20      19      18      18      15 
##  0.9955  0.9956  0.9958  0.9979  0.9959   0.996  0.9967  0.9971  0.9987 
##      14      14      14      14      13      13      13      13      12 
##  0.9996 0.99538  0.9965   0.995  0.9961  0.9981  0.9991  0.9998       1 
##      12      11      11      10      10      10      10      10      10 
##  1.0002  0.9948  0.9952 0.99572 
##      10       9       9       9

For density, the highest frequencies of observations end with an even number. This accounts for the spikes in the distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH levels are most dominant between 3.2 and 3.5

## 
##  9.5  9.4  9.8  9.2   10 10.5  9.3  9.6   11  9.7  9.9 10.9 10.1 10.2 10.8 
##  139  103   78   72   67   67   59   59   59   54   49   49   47   46   42 
## 10.4 11.2 10.3 11.3 11.4    9 11.5 11.8 10.6 10.7 11.1  9.1 11.7   12 12.5 
##   41   36   33   32   32   30   30   29   28   27   27   23   23   21   21 
## 11.9 12.8 11.6 12.1 12.4 12.2 12.3 12.7 12.9   14 
##   20   17   15   13   13   12   12    9    9    7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Vinho Verde wine (red variety) is mainly represented with alcohol volumes ranging from 9.5% to 11.5%; median 10.2% and mean 10.42%.

Most of the wines fall between 9% and 11% alcohol by volume with gradually fewer wines of higher alcohol volume. The majority of wines are between 3.2pH and 3.4pH. Fixed acidity is skewed to the right with most wines containing concentrations of tartaric acid at 9g / dm^3 or less.

Univariate Analysis

What is the structure of your dataset?

The dataset consists of 1,599 red wines (Vinho Verde) with 11 scientifically measured variables (numeric) and 1 sensory output variable (integer) in the form of a score (1-10).

Other observations:

  • A surprising number of producers choose not to add citric acid.
  • Most wines have alcohol volume greater than 10%, but most frequent observations are between 9% and 10% alcohol volume.
  • The median score assigned is 6 and the mean score is 5.636

What is/are the main feature(s) of interest in your dataset?

The main features of the data set are volatile acidity, alcohol and score. I am interested in creating a link between score and the other two variables. My suspicions tell me that sulphate levels can in part help predict score.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Acidity levels and sulphate levels will probably influence the score. I think alcohol volume and volatile acidity are the biggest predictors.

Did you create any new variables from existing variables in the dataset?

Yes. I categorised the variables citric.acid and volatile.acidity as new factor variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Citric acid was probably the most unusual with so many observations forgoing the addition. The spike around 0.5 for that same variable is also interesting.

Bivariate Plots Section

##                      citric.acid fixed.acidity volatile.acidity
## citric.acid                1.000         0.672           -0.552
## fixed.acidity              0.672         1.000           -0.256
## volatile.acidity          -0.552        -0.256            1.000
## residual.sugar             0.144         0.115            0.002
## chlorides                  0.204         0.094            0.061
## sulphates                  0.313         0.183           -0.261
## total.sulfur.dioxide       0.036        -0.113            0.076
## density                    0.365         0.668            0.022
## alcohol                    0.110        -0.062           -0.202
## quality                    0.226         0.124           -0.391
##                      residual.sugar chlorides sulphates
## citric.acid                   0.144     0.204     0.313
## fixed.acidity                 0.115     0.094     0.183
## volatile.acidity              0.002     0.061    -0.261
## residual.sugar                1.000     0.056     0.006
## chlorides                     0.056     1.000     0.371
## sulphates                     0.006     0.371     1.000
## total.sulfur.dioxide          0.203     0.047     0.043
## density                       0.355     0.201     0.149
## alcohol                       0.042    -0.221     0.094
## quality                       0.014    -0.129     0.251
##                      total.sulfur.dioxide density alcohol quality
## citric.acid                         0.036   0.365   0.110   0.226
## fixed.acidity                      -0.113   0.668  -0.062   0.124
## volatile.acidity                    0.076   0.022  -0.202  -0.391
## residual.sugar                      0.203   0.355   0.042   0.014
## chlorides                           0.047   0.201  -0.221  -0.129
## sulphates                           0.043   0.149   0.094   0.251
## total.sulfur.dioxide                1.000   0.071  -0.206  -0.185
## density                             0.071   1.000  -0.496  -0.175
## alcohol                            -0.206  -0.496   1.000   0.476
## quality                            -0.185  -0.175   0.476   1.000

Density appears to correllate with a number of variables. Fixed acidity is the most notable. Alcohol, citric acid and residual sugar also appear to influence density to some degree.

Looking at a subset of the data, fixed acidity, residual sugar and density appear to have little or no impact on the quality of the wine. Alcohol on the other hand has a notable correlation with both density and quality. The next step will take a closer look at the realtionships between quality and a few other variables like alcohol, residual sugar and volatile acidity.

As alcohol volume increases, the variance in quality decreases. The vertical lines represent the resolution of the measurement, which is rounded to one decimal place. The relationship between alcohol and quality appaers to be linear.

The plot is scaled to exclude the top 1% of observed residual sugar values. Most of the wines have residual sugar between 1g and 3g / dm^3. Clearly there is no meaningful relationship here.

The observations have been scaled like before. The addition of jitter and transparency helps to highlight a clear negative correlation between volatile acidity and quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality correlates most significantly with alcohol volume, and shows a negative correlation to a lesser degree with volatile acidity.

As alcohol level increases, the variance in quality decreases. On the plot representing the relationship between quality and alcohol, the observations become fewer as the alcohol level increases with a noticeable lack of lower scores. The higher score frequency shows a slight increase. The relationship looks to be linear.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Citric acid levels are showing a decent positive correlation with fixed acidity and a negative correlation with volatile acidity. Fixed acidity seems to be a factor in citric acid additions. The addition of citric acid while providing enhanced flavour may also have a curtailing effect of volatile acid producing microbes.

What was the strongest relationship you found?

The quality of a wine is positively and reasonably correlated with alcohol volume, while negatively and slightly less correlated with volatile acidity. No other variables show a significant correlation with quality. Since both variables show no significant correlation with each other, there is an opportunity to explore both when generating predictive models.

Multivariate Plots Section

When volatile acidity increases, the median quality score decreses. It also appears that alcohol volume percentages of 12 or above are indicative of higher scores which seem to be aided by low volatile acidity.

The ralationship between citric acid concentration has no impact on the quality of the wine. With the addition of the alcohol variable. It is once again evident that higher alcohol is somewhat associated with higher quality.

If a linear predictive model for quality can be built, looking at other variables and their correlations with quality will be required.

Fixed acidity appears to have no impact on quality. Volatile acidity in the presence of fixed acidity does not follow any identifiable pattern.

Citric acid levels show some correlation with fixed acidity with higher levels measured as fixed acidity increases. Citric acid has no noticeable impact on quality in the presense of fixed acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Wines with higher volatile acidity have lower median quality scores per alcohol volume. The variance appears quite constant across the groups with high levels showing the least variance between the 1st and 3rd quartiles.

Low levels of citric acid correspond with higher levels of volatile acidity.

Higher citric acid levels correspond with higher levels of fixed acidity but don’t show a meaningful relationship with quality.

Were there any interesting or surprising interactions between features?

The variance in fixed acidity grew as citric acid levels increased with High levels of citric acid showing perhaps the biggest variance in terms of fixed acidity and quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No models were created.


Final Plots and Summary

Plot One

Description One

The distribution of wine quality is normal. Quality ratings greater than 5 are better represented than those below 5.

Plot Two

Description Two

Wines with the highest levels of volatile acidity bring the median quality score down slightly. In the presence of volatile acidity, alcohol volume plays a less noticeable part in overall quality.

Plot Three

Description Three

This plot shows the lack of clear and obvious causes for predicting quality. The distributions follow no clear pattern to help with establishing a predictive linear model.


Reflection

The Vinho Verde red wine data set contains numeric measurements and sensory output on almost 1,600 Vinho Verde red wines across 12 variables from around 2009. This study started by looking at single variables in the data set, from which I took a deeper look using different plotting choices.

The relationship between quality score and alcohol volume showed some promise early on. I was a little surprised that sulfur dioxide levels had a negligible impact on quality.

With such a small variance on the output variable, building a predictive model may not be that useful. Even the strongest correlations were not terribly significant is this data set. The big take away from this examination is that alcohol volume has a noticeable impact producing higher average scores and volatile acidity will bring the quality down. However, In the epresence of other variables these observations fall short of being reliable. Perceived wine quality appears to be all about striking the right balance rather than executing an exact formulation.